Structured Indexing Model for Cross-Language Information Retrieval

نویسندگان

  • Chedi Bechikh Ali
  • Hatem Haddad
چکیده

In recent digital library systems or World Wide Web environment, parallel corpora are used by many applications (Natural Language Processing, machine translation, terminology extraction, etc.). This paper presents a new cross-language information retrieval model based on the language modeling. The model avoids query and/or document translation or the use of external resources. It proposes a structured indexing schema of multilingual documents by combining a keywords model and a keyphrases model. Applied on parallel collections, a query, in one language, can retrieve documents in the same language as well as documents on other languages. Promising results are reported on the MuchMore parallel collection (German language and English language). RÉSUMÉ. Dans les systèmes récents de bibliothèques numériques ou dans le contexte du Web, les corpus parallèles sont utilisés par de nombreuses applications (traitement du langage naturel, la traduction automatique, extraction de terminologie, etc.). Cet article présente un nouveau modèle de recherche d’information inter-langue basé sur le modèle de langue. Le modèle évite la traduction des requêtes et/ou des documents ainsi que l’utilisation des ressources externes. Il propose un schéma d’indexation structurée des documents multilingues en combinant un modèle de mots-clés et un modèle de phrase-clés. Appliquée sur une collection parallèle, une requête dans une langue, peut récupérer des documents dans la même langue ainsi que des documents dans d’autres langues. Appliqué à la collection parallèle MuchMore (en langue allemande et en langue anglaise), le modèle a montré des résultats prometteurs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Domain-Specific Track CLEF 2005: Overview of Results and Approaches, Remarks on the Assessment Anaalysis

The domain-specific track aims at monoand cross-language information retrieval on structured scientific data. This track studies retrieval in a domain-specific context using two social science databases: The German Indexing and Retrieval Testdatabase (GIRT) (forth version GIRT-4: German/English pseudo-parallel corpus with identical documents) with 302,638 documents in total, and the Russian Soc...

متن کامل

Translation-Based Indexing for Cross-Language Retrieval

Structured queries have proven to be an effective technique for crosslanguage information retrieval when evidence about translation probability is not available. Query execution time is adversely impacted, however, because the full postings list for each translation is used in the computation. This paper describes an alternative approach, translation-based indexing, that improves query-time eff...

متن کامل

Japanese-Chinese Cross-Language Information Retrieval: An Interlingua Apporach

Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an i...

متن کامل

Garnata: An Information Retrieval System for Structured Documents based on Probabilistic Graphical Models

In this paper, Garnata, an information retrieval system for XML documents is presented. This system is specifically designed for implementing Bayesian network-based models for structured documents. We show its architecture and performance from the indexing and the retrieval points of view, coming to the conclusion that the system is flexible and fast.

متن کامل

Indexing a web site with a terminology oriented ontology

This article presents a new approach in order to index a Web site. It uses ontologies and natural language techniques for information retrieval on the Internet. The main goal is to build a structured index of the Web site. This structure is given by a terminology oriented ontology of a domain which is chosen a priori according to the content of the Web site. First, the indexing process uses imp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016